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ABSTRACT 

Many systems for big data analytics employ a data flow abstrac- 
tion to define parallel data processing tasks. In this setting, custom 
operations expressed as user-defined functions are very common. 
We address the problem of performing data flow optimization at 
this level of abstraction, where the semantics of operators are not 
known. Traditionally, query optimization is applied to queries with 
known algebraic semantics. In this work, we find that a handful 
of properties, rather than a full algebraic specification, suffice to 
establish reordering conditions for data processing operators. We 
show that these properties can be accurately estimated for black 
box operators by statically analyzing the general-purpose code of 
their user-defined functions. 

We design and implement an optimizer for parallel data flows 
that does not assume knowledge of semantics or algebraic proper- 
ties of operators. Our evaluation confirms that the optimizer can 
apply common rewritings such as selection reordering, bushy join- 
order enumeration, and limited forms of aggregation push-down, 
hence yielding similar rewriting power as modern relational DBMS 
optimizers. Moreover, it can optimize the operator order of non- 
relational data flows, a unique feature among today's systems. 

I. INTRODUCTION 

We are witnessing a data explosion in a variety of domains, in- 
cluding large-scale scientific data collection from various sensors, 
user-generated data, and data resulting from tracking human behav- 
ior online or otherwise. For example, the Large Hadron Collider at 
CERN generates around 15 petabytes per year [1], and the LSST 
telescope is expected to generate about 0.5 petabytes per month 
when it becomes operational [8]. Similar data volumes are ex- 
pected to be created by next-generation DNA sequencing technolo- 
gies [6]. It is now widely believed that a number of future scientific 
breakthroughs will be empowered by the ability to quickly analyze 
vast amounts of data. Similarly, the competitive advantage of many 
enterprises that operate on a web scale critically depends on draw- 
ing insights from huge data sets. 

During the last years, it became clear that relational DBMSs 
could not cope with the scale and the nature of today's big data 
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problems. This is due to a variety of reasons, including obso- 
lete architectures [30], and trying to "fit" new problems to the re- 
lational model of programming. In 2004, Google reported their 
results on analyzing 100 terabytes of (mostly unstructured) data 
per day using their MapReduce framework [17], a number that 
grew to 20 petabytes per day in 2008 [18]. Partly motivated by 
these breakthroughs, new big data analysis systems have emerged 
to serve the aforementioned needs. Many of these systems such as 
Hyracks [11], Dryad [25], and our own Stratosphere system [7] 
adopt a data flow abstraction, where a data analysis program is 
specified as a directed acyclic graph (DAG) of smaller components 
that contain arbitrary user code. Even though some of these sys- 
tems offer higher-level language interfaces [10, 12, 28, 31], sup- 
porting parallel user-defined functions (UDFs) is a fundamental re- 
quirement for these systems. Recently, commercial parallel DBMSs 
such as Aster Data and Greenplum have adopted MapReduce-style 
UDFs [2, 20] to explore a wider scope of applications. 

The common challenge faced by these systems is to efficiently 
execute parallel data flows that embed UDFs. This entails paral- 
lelization, as well as reordering of operators. These two problems 
are highly coupled, as the optimal parallelization strategy depends 
on the operator order and vice versa. Traditional RDBMS opti- 
mizers support only UDFs that follow very strict templates such 
as scalar, aggregation, and table-generator UDFs. Due to these 
strict templates, the main challenge for RDBMS optimizers is not 
whether UDFs can be reordered but rather when it is beneficial. In 
contrast, MapReduce-style UDFs implement much less restrictive 
templates and hide their semantics inside general-purpose impera- 
tive code, a fact that poses new challenges for optimization. Con- 
ventional wisdom dictates that query optimization is possible at an 
abstraction layer where the semantics and the algebraic properties 
of operators are known. In this work, we build a query optimizer 
that does not require this assumption. Rather, our optimizer per- 
forms a fully automatic static code analysis pass over the UDFs, 
discovering a handful of properties that guarantee safe reorderings. 
We observe that a few properties, rather than knowledge of full 
semantics, are enough to enable many optimizations, including se- 
lection and join reordering, as well as limited forms of aggregation 
push-down. 

The contributions of this paper are the following: 

1. We introduce the problem of reordering data flow programs 
that consist of arbitrary imperative user-defined functions. 

2. We formally establish the necessary conditions to reorder UDFs 
with a fixed signature (e. g., Map and Reduce) in a data flow. 

3. We show how to derive the necessary knowledge for reordering 
via a static code analysis pass over the imperative UDF imple- 
mentations. 
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4. We design and implement a query optimizer for this setting. In 
particular, we present a novel plan enumeration algorithm that 
does not use algebraic properties. 

5. We implement the above concepts in the Stratosphere system [4], 
and conduct an extensive experimental study. 

6. Our experimental results show that we can reproduce most re- 
orderings done by traditional query optimizers in relational qu- 
eries such as join and selection reordering and some forms of 
aggregation push-down. Further, our system can automatically 
find optimal plans for non-relational tasks without being in- 
formed a priori about the semantics of the operators. 

While we present our optimizer in the context of the Stratosphere 
system, the results presented in this paper are applicable to a variety 
of parallel data flow systems that use imperative UDFs. 

The remainder of this paper is organized as follows. Section 2 
presents background material on Stratosphere's architecture, data 
model, and programming model. Section 3 introduces the problem 
by means of an example, and outlines the salient points of our so- 
lution. Section 4 delves into the details, and presents formal proofs 
for rewriting operators. Section 5 shows how to derive the infor- 
mation required by the optimizer using static code analysis. Sec- 
tion 6 presents the design of our query optimizer, including the plan 
enumeration algorithm. Section 7 presents our experimental study. 
Finally, Section 8 presents related work, and Section 9 concludes 
and offers research directions. 

2. BACKGROUND: STRATOSPHERE 

2.1 System Architecture 

The Stratosphere system consists of two distinct components: 
The Nephele execution engine [7, 32], and the PACT compiler [7]. 
The user writes data analysis tasks in Java by providing first-order 
functions for a fixed set of second-order functions called Paral- 
lelization Contracts (PACTs, see Section 2.3 for details). The PACT 
compiler is responsible for translating the user-defined program 
into an efficient DAG data flow program, which is then deployed 
and executed by the Nephele engine. During compilation, the PACT 
compiler can exploit some declarative aspects of the PACT program 
in order to make cost-based decisions similar to a relational DBMS 
query optimizer, i.e., it decides on data shipping and local execu- 
tion strategies for operators [7]. For example, the PACT compiler 
chooses between a partitioning, replication, or combined strategy 
for a parallel join (which is specified using the Match second-order 
function in the PACT programming model). The work described 
in this paper enables the PACT compiler to reorder operators in the 
data flow, in addition to choosing parallelization strategies. 

2.2 Data Model 

Stratosphere has recently migrated from a key-value pair data 
model to a record data model. The reasons for the new data model 
are twofold: First, it increases end-user productivity by allowing 
the programmer to work with more structured data rather than co- 
alescing the data to a single value at every step of the data analysis 
program. Second, the new data model exposes more knowledge 
about the data analysis task to the compiler, making several new 
optimizations possible. For example, the optimizations presented 
in this paper would be rather limited if a simple key-value data 
model was used. 

We define a data set as an unordered list of records, and denote 
it by D — [n , ■ ■ ■ , r„] . A record is an ordered tuple of values, r — 
(vi, . . . , v m ). The semantics of the values, including their type is 
left to the user-defined functions that manipulate them. We define 



two data sets Di, D2 as equal (denoted as D\ = D2) when there 
exist two orderings of their records, such that D\ — [m , . . . , ri n ], 
D2 = [f2i, • • • ,T2m], n = m and Vi = 1, . . . ,n : m = r 2 i- 
Two records n = (tin, . . . , vi„) and r 2 = (V21, ■ ■ • , v 2m ) are 
equal (n = r 2 ) iff n = m and Vi = 1, . . . , n : vu = V2i. 

2.3 Programming Model 

The PACT programming model [7] is a generalization of the 
MapReduce programming model [17]. A PACT program is a di- 
rected acyclic data flow composed of data sources, data sinks, and 
operators. An operator consists of a second-order function and 
an associated first-order user-defined function (UDF). In addition, 
some second-order functions require the specification of special 
(possibly composite) "key" fields. The first-order UDF can emit 
an arbitrary number of output records per invocation, possibly with 
modified value types. The second-order function defines how the 
input data set is partitioned into groups and applies the first-order 
function to each group independently. Hence, groups are processed 
in a data-parallel fashion possibly on different nodes without in- 
curring communication overhead. Thereby, the type of the second- 
order function defines the parallelization opportunities for a given 
operator. 

There are currently five second-order functions (called PACTs) 
implemented in Stratosphere: Map, Reduce, Cross, Match, and 
CoGroup (see Figure 1). The Map function dictates that every in- 
put record forms an individual group. The Reduce function dictates 
that a group exists for every unique value of the key attribute in the 
input data set, and contains all records with the particular key value. 
The Cross, Match, and CoGroup second-order functions are used 
to define binary operators. The Cross function forms a group from 
every pair of records in its two inputs, similarly to forming a dis- 
tributed Cartesian product of two sets. The Match function forms 
a group from every pair of records in its two inputs, only if the 
records have the same value for the key attribute. Match is there- 
fore similar to an equi-join. Finally, the CoGroup function forms a 
group for every value of the key attribute (from the domains of both 
inputs), and places each record in the appropriate group depending 
on the key value of the record. 

More formally, assume two input data sets R = [n, . . . , rw] 
and S = [si, . . . , s M ]- The Map PACT is defined as 

Map:7?x/^ [/(ri),...,/(nv)] 

where / is the user-defined first-order function of Map. The Reduce 
function is defined as 

Reduce : R x / x K -> [f{r\\. . .,r k n \),.. . , f(r\\ . . . ,rj?,)] 

where K is a set of attributes of R called the key, the active domain 
of K in R is {ki, . . . , k;}, and for record r k it holds that r.K = k. 
Note that the UDF / of a Reduce function operates on a list of input 
records. The Cross and Match functions are defined as 

Cross -.RxSxf^ [f(n, si), f(n,s 2 ), f(r N , s M )] 

Match : Rx S xKxF x f -> [{f(r, s)\r.K = s.F}] 

where K and F are the keys of the Match function for R and S 
respectively. The CoGroup function is defined as 

CoGroup :]Jx Sx Kx Fx / -> 

[fCr V1 r" 1 <i vi « V1 ) ftr" 1 r v ' s v ' «" 1 )1 

where the combined active domain of K and F is {vi, . . . , v;}. 

We distinguish between PACTs whose UDF is called with ex- 
actly one record per input (Map, Match, and Cross) as argument 
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Figure 1: (a) Map, (b) Reduce, (c) Cross, (d) Match, and (e) CoGroup second-order functions. 



and PACTs whose UDF is called with a list of records per input 
(Reduce and CoGroup). We call the former record-at-a-time (RAT) 
operators, and the latter key-at-a-time (KAT) operators. For the lat- 
ter, we refer to all input records of data set D with a specific key 
value k as a key group D^. 

3. A REORDERING EXAMPLE 

We address the concrete problem of optimizing PACT programs, 
in which the algebraic properties of first-order functions are not 
known. Our solution proceeds in three steps: First, in Section 4, 
we establish the necessary conditions to reorder PACT operators. 
At this stage, we treat the UDFs of operators as black boxes. Our 
key insight is that a few properties, rather than full semantics, suf- 
fice to establish many reordering conditions. Next, in Section 5, we 
show how to safely approximate these properties, by "opening" the 
black box operators via a static code analysis pass over their code. 
Finally, in Section 6, we show how to enumerate plans when the 
concept of algebraic expressions does not apply. We first demon- 
strate the salient points of our complete solution with an example. 

Assume a PACT program P that consists of three Map operators 
with first-order functions fa, fa, and fa interconnected as follows: 

P : I -> Map x -> Map 2 -> Map 3 -> O 

The input data set I contains two integer attributes (A,B). The first 
function fa replaces B with \B\. The second function fa emits all 
records for which A > and filters the rest of the records, and the 
third function fa replaces A with the sum A + B. For example, 
with input record i = (2, —3), the data flow is 

(2, -3) -+/i-> (2, 3) -> fa -> (2, 3) -> fa -> (5, 3) 

while with input record i' = (—2, —3} the data flow is 

(-2, -3) -»■ fa -> (-2, 3) -+ fa -»■ _L -»■ fa -»■ _L 

where _L represents the empty list. 

Consider now the alternative plan P' where the order of Map 2 
and Mapj is inverse: 

P' : I -> Map 2 -> Map 1 -> Map 3 -»■ O 

The data flow for records i and i' is 
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Observe that the order of Map x and Map 2 does not influence the out- 
put data set O. Therefore, for input I = [i, i'], these two operators 
can be safely reordered. In fact, if fa filters a significant portion of 
the records in /, this reordering is desirable. On the other hand, fa 
and fa cannot be further reordered without changing the result: 

(2, -3) -> fa -> (2, -3) -> / 3 -> (-1, -3) -»■ /i -> (-1, 3) 

We generalize this concept in a safe manner without knowing 
the semantics of the operators. Our key insight is that reasoning 
about the "conflicts" in the dataflow suffices to establish reorder- 
ing conditions. For example, we do not need to know whether fa 



computes A + B or A ■ B. We only need to know that fa replaces 
the first field of its input record with a new value, which conflicts 
with fa using the first field of its input record to potentially fil- 
ter some records. We can therefore establish that these operators 
"conflict" on A, and cannot be reordered. This holds only if the 
execution path of a UDF is uniquely determined by its input data, 
i. e., communication between functions except via the explicitly de- 
fined data channels of the data flow program (e. g., shared memory 
or other forms of communication) is prohibited. We assume this 
restriction throughout this paper. 

We define a read set R/, and a write set W/ for each opera- 
tor with respect to its UDF /. These sets are defined over at- 
tributes that need to be extracted from the plan. In our example 
plan, we have two attributes A, B, that form the so-called global 
record A = {A, B}. The read set of an operator contains all at- 
tributes that might influence the operator's output. The write set 
of an operator contains all attributes whose values change with an 
application of the operator. We formalize these concepts in Sec- 
tion 4. Two operators "conflict" on an attribute if the attribute is 
contained in both operators' write sets, or in one operator's read set 
and the other's write set. For example, operator fa has R/ x = {B}, 
and \N fl = {B}, and operator fa has R/ 2 = {^4}, and W/ 2 = 0. 
These operators do not conflict, and can therefore be reordered. 

The next challenge we address is how to derive read and write 
sets among other necessary properties. In Section 5 we present an 
algorithm that estimates these properties using a static code analy- 
sis (SCA) pass over the code of the first-order functions. Assume 
the code of the three example first-order functions shown below in 
the form of 3-address code [5] where the UDFs access fields A and 
B by their positions (0 and 1 respectively) in the input record: 



20: 


f 2(InputRecord $ir) 


23 


$or :=copy($ir) 


21: 


$a:=getField($ir,0) 


24 


emit ($or) 


22: 


if($a<0) goto 25 


25 


return 


10: 


f 1 (InputRecord $ir) 


30 


f 3 (InputRecord $ir) 


11: 


$b:=getField($ir, 1) 


31 


$a: =getField($ir , 0) 


12: 


$or :=copy($ir) 


32 


$b : =getField($ir , 1) 


13: 


if ($b>=0) goto 16 


33 


$sum:=$a+$b 


14: 


$b:=-$b 


34 


$or=copy ($ir) 


15: 


setField($or,l,$b) 


35 


setField($or , , $sum) 


16: 


emit ($or) 


36 


emit ($or) 


17: 


return 


37 


return 



The instructions with labels 10 to 17 are the code of function fa, 
with labels 20 to 25 of fa, and with labels 30 to 37 of fa. Consider 
for example the code of function fa. Recall that fa filters records 
with negative values for attribute A. We can automatically detect 
that A G Rf 1 by collecting all getField statements (in this case 
instruction 21), and determining whether the temporary variables 
introduced (in this case $a) are used in the function's code. In our 
example, instruction 22 uses the value of $a in a condition, so we 
conclude that field of the input record is part of the read set. In 
the same way, we can detect that A £ W/ 3 by looking at instruc- 
tion 35, which potentially changes the value of field 0. We can 
thus conclude that fa and fa conflict on field 0, and cannot be re- 
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ordered. This estimation is conservative, but safe. It results in a set 
of reorderings that all produce the same query result, but it might 
miss valid reorderings. For example, assume that the input data set 
I contains only values with A > 0. Then, instructions 22 and 23 
of function f 2 will never be executed, and in fact, f 2 and fz can be 
reordered. However, this is something that cannot be detected by 
static code analysis, and this reordering will be prohibited by our 
system. 

4. CONDITIONS FOR REORDERING 
4.1 Definitions 

The user-code of operators accesses record attributes by static 
field indices. However, the reordering of two operators can cause 
changes of the mapping of field indices to attributes. Since the 
user-code assumes the original mapping, it is essential to avoid that 
attributes are accessed by wrong indices in order to preserve the 
original semantics of the data flow. For this purpose, we define the 
global record as a collection of every attribute that is accessed by 
any operator in the execution plan. Thus, the global record includes 
every attribute of the input data sets as well as the attributes that are 
created by operators at some stage of the execution plan. 

Definition 1. The global record A is a unique naming of all base 
and intermediate attributes in the data flow. In addition, we define 
a redirection map a(D, n), which maps every field index n G N of 
every data set D (base or intermediate) to the corresponding entry 
in the global record A. 

Next, we formally define the read and write sets. Denote by D 
the attributes of data set D, and by #D the number of attributes 
of D. The write set W/ of an operator with first-order function / 
contains the attributes whose value might change after applying /. 

Definition 2. An attribute A belongs to the write set W/ of an 
operator with UDF /, input /, and output O iff: 

(1) A = a(0,m), m > #l,or 

(2) A — a(I, n), 3i G Instances(I) : 3oi G f(i) : Oi[n] =fc i[n] 

The definition captures the fact that an attribute is in W/ if it is 
either newly created by / (case 1 of the definition), or that there 
exists at least one record in the data set with a different value of this 
attribute after / is applied (case 2 of the definition). The above def- 
inition can be extended for UDFs that operate on multiple records. 
The read set R/ of an operator with user function / contains the 
attributes that might influence the operators's output. 

Definition 3. An attribute A = a(I, n) belongs to the read set R/ 
of an operator with UDF /, input /, and output O iff: 

3«i,«2 G Instances(I),Vm / n: ii[n] / i 2 [n] A ii[m] = i2[m] 

(1) (|/(ii)|^|/(i 2 )|),or 

(2) (3oi G f(ii),02 G f(i 2 ),k^n: oi[fc] ± o 2 [k\) 

The definition captures the fact that an attribute A can influence 
f's output if a change on A's value only may produce a different 
output. Note that key attributes of KAT operators are always in- 
cluded to the read set because they directly influence the operator's 
result. Note that the above definitions do not use the semantics of 
the functions. Section 5 discusses how to approximate these sets 
using static code analysis of the UDFs. 

Finally, we define two conditions that are necessary for reorder- 
ing of operators in most cases: 



Definition 4. Two operators with UDFs fi, f 2 satisfy the read- 
only conflict (ROC) condition iff R fl n W/ 2 = Wj, n R/j = 

w A nw /2 =0. 

The ROC condition captures the fact that a UDF does not update 
or use attributes that another UDF updates. The ROC condition 
is necessary for all reorderings described in this paper. To reorder 
KAT operators, we additionally need the condition that key groups 
are preserved: 

Definition 5. An operator with UDF / satisfies the key group preser- 
vation (KGP) condition for an attribute set K C A iff (1) Vr G / : 
|/(r)| = 1, or (2) |/(r)| < 1, and 3F, F C K : Vr, r' G / : 
Mr) = Mr')^\f(r)\ = \f(r')\ 

The projection it of a record on a set of attributes is defined as usual. 
The above definition can be extended for KAT operators. The KGP 
condition states that function /, when applied to a set of records Ik 
with the same value for K, either emits or filters all these records. 

4.2 Reordering MapReduce Programs 

4.2.1 Reordering Map Operators 

In Section 3 we outlined why two Map operators that satisfy the 
ROC condition can be reordered without changing the query result. 
We now prove this statement formally. 

Theorem 1. Two Map operators can be reordered if their first- 
order functions satisfy the ROC condition. 

PROOF. Assume the two plans 

P : I -> Ma P/ -> S -> Map ff -> O 

P' : I -> Ma Pg S' -> Ma P/ -> O' 

We prove that O = O' . Assume a record i 6 J, and let Oi = 
Map s (Ma P/ ([i])), 0\ = Ma P/ (Map 9 ([i])), Si = Ma P/ ([i]) = fit), 
and S'i = Map 9 ([i]) = g(i). It suffices to prove Vi G / : Oi = 0\. 
We first observe that if the ROC condition holds, the global record 
can be partitioned as A = W U (W/UW 3 ), where AilB addition- 
ally implies that A n B — 0. We define Tvp(r) as the projection of 
record r to attribute subset F. 

First, we prove that an invocation of / and g produces the same 
result cardinality in both plans: = = k where s'j G 

S'i, and \g(i)\ = |g(s»)| = I where Sj G Si. Records s'j G S[ 
are produced by applying g to i. Recall that g can only change 
W 9 attributes, therefore 7Twuw ( s j) = 7r wuw / (*)- Observe that 
the execution path of / depends only on the values of attributes 
in W U W/. Therefore, / follows the same execution path for s'j 
and i, and the cardinality of its output is the same: Ws'j : \f(i)\ = 
\f{s'j)\ = k. We can similarly prove \g(i)\ = \g(si)\ = I. This 
allows us to decompose plan P for input i as 

P l :i^ /-> \si\i = l,...,fc] 
P 2 :si^ g^ [ 0ij \j = 1,...,Z] Vi = l,...,k 
and plan P' as 

Pi :i -»■ g -> [s'j\j = l,...,l] 
P£:8' j ->f->[o' ji \i = l,...,k]Vj = l,...,l 

We will now prove that Vi = 1, . . . , k, Vj = 1, . . . , I : Oij = 
o'^. We observe that 7Tvv( <j) ~ 7r w(°ji) smce attributes in W are 
not changed by either / or g. Therefore, it suffices to prove that (1) 
ttwj, {on) = 7r W/ (o'ji), and (2) 7Tw g (oij) = TTw g (o'ji). The proofs 
for the two cases are completely symmetric. We proceed to prove 
case (1). 
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From sub-plan P2 we observe that records Oij are produced by 
applying g to records Si . Therefore, they have the same values for 
all attributes that g does not change: tt WuW/ (o i3 ) = tt WuW/ (si). 
It suffices thus to prove ivwAo'ji) = ivwf(si). Consider sub- 
plans Pi and P2 that show the application of / to i and s'j respec- 
tively. First, observe that s'j comes from applying g to i, therefore 
7r W/ (Sj) = T Wf (i). The execution path of / depends only on 
values of attributes in W U W/. Since 7rw^ (s'j) = tywj (i), the ex- 
ecution of / in sub-plans Pi and P2 will follow the same execution 
path. Therefore, the changes applied to i will be the same as the 
changes applied to s'j. Therefore, tt\n s (o^) = h\n s (si). □ 

4.2.2 Reordering Map and Reduce Operators 

Recall that unlike the MapReduce model, the PACT model al- 
lows arbitrary data flows containing Map and Reduce (among other) 
operators. Assume the plan 

P : / -> Map / -¥ S -¥ Reduce 9 -> O 

with input I having two attributes (A, B). UDF / emits all input 
records with odd values both of A and B. UDF g calculates the 
sum of B using A as key, and appends the sum as a new attribute 
C to all of its input records. Note that the ROC condition holds. 
Consider the input data set in the following example application of 
the plan: 



(M),(l,2), 
<2,1>,(2,2) 



-> Ma P/ -> [(1, 1)] -> Reduce 9 -> [(1, 1, 1}] 



and the execution if the operators are reordered 



(M), (1,2), 
<2,1),<2,2) 



► Reduce,, 



(1,1,3), 
(1,2,3), 
(2,1,3), 
(2,2,3) 



•Map, -»■ [(1,1,3)] 



The ROC condition alone cannot guarantee the reordering of a Map 
and a Reduce operator. The reason is that the key groups of the 
Reduce operator in the two plans do not have the same cardinality, 
and thus result in a different value for attribute C. This would 
not be a problem if the Map operator either eliminated whole key 
groups, or left them intact. Note that if Map also emitted multiple 
records per call, the cardinality of the key groups would change. 
Therefore, we need the KGP condition to hold as well. 

Theorem 2. A Map operator with UDF / and a Reduce operator 
with UDF g can be reordered if the ROC condition holds for /, g, 
and the KGP condition holds for / and the key K of the Reduce 
operator. 



PROOF. Consider the two pipelines: 



P: I - 
P':I 



K&Pf — > 
> Reduce 



S- 



-> Reduce 9 
5' -> Map, 



o 



of 



As before, we prove that O = O'. Let I = U k I k , where J k is the 
key group with key value k and the plans 

P : J k -> Map / -> 5 k -> Reduce 9 -> O k 
P' : J k -> Reduce,, -> 5 k -> Map / -> O k 

It suffices to prove that O k = 0' k . Observe that if the KGP condi- 
tion holds, |5k| — |Jk|,or|5k| = 0. If |5k = 0, then Mapy will also 
filter all records from S' k in P', and trivially Ok = O k — _L. As- 
sume that |J k | = 5k| = k, and \O k \ = Since the Reduce UDF 
treats h in P' in the same way as 5k in P (because \h\ = \Sk\ 



and the ROC condition holds), and the Map UDF emits exactly one 
record per input, it holds that |5 k | = \0' k \ = I. Therefore, we can 
decompose plan P as 

Pi : Vi G [ii, . . . ,i k ], i -»■/-»■ s, s G [si, . . . , s fe ] 
P 2 : [si, . . . ,s fc ] -»■ 5 -»■ [01, . . . ,oj] 

and plan P' as 

K : [ti, 
P^:Vs' G [si,... 



!,«'->■/-»• o', o' G [oi, . . . ,oJ] 



We now prove that = 1, . . . , I : Oj = o'j. Due to the ROC 
condition it suffices to prove (1) nvj f (oj) = -Kvj f (o'j), and (2) 

nw g (oj) = 7TW 9 (Oj). 

We proceed to prove case (1). Case (2) is proven similarly. 
From sub-plan P2, and record Oj, there is a record s x with the 
same attribute values for W/: V?', j = 1, . . . , I 3x, x = 1, . . . , k : 
TTWf (oj) = 7TWJ (fix)- Note that Reduce may "consolidate" multi- 
ple records into one, or produce multiple records per input record. 
However, due to the ROC condition, attributes in W/ must be pre- 
served. Similarly, from P[ we have \fj,j = 1, . . . , I By, y = 
l,...,k : 7TR f (s'j) = 7Tr f (i y ) (due to the ROC condition, at- 
tributes in Rf are preserved as well). Using the same arguments 
as in the proof of Theorem 1 , we know that g follows the same 
execution path in sub-plans P2 and P[. 

Therefore, / follows the same execution path for records s'j and 
i x , so the result records of applying / to these records will also 
share the same values for W/ attributes: 7r W/ {o'j) = 7r W/ (s x ) => 
Tyjfio'j) = nw f (oj). □ 

The condition for reordering two Reduce operators are the ROC 
condition and the KGP condition for both UDF-key pairs. The 
proof proceeds similarly. 

4.3 Reordering Binary Second-Order Func- 
tions 

4.3.1 Record-at-a-time Operators 

We first cover plans with RAT operators that are constructed us- 
ing the Cross, Match, and Map PACTs. Assume a Cross operator 
with UDF / and inputs R, 5. The operator applies / to every pair 
(r, s) G R x 5. Here, the Cartesian product R x 5 of two data 
sets R = [n, . . . , r„], 5 = [s 1 , . . . , s m ] is defined as a data set 



R x 5 — WAsj 



i = l. 



,n,j = 1, . 



, ml where r Is is the 



concatenation of records r and s. The attribute set of R x 5 is the 
union of the attribute sets of R and 5 with a proper renaming (e. g., 
each attribute is prefixed by the data set name it belongs to). 

We observe that we can conceptually transform a Cross operator 
to a Map operator with the same UDF over the Cartesian product: 

Cross/ (R, 5) ee Ma P/ (R x 5) 

We can similarly transform a Match operator with UDF / to a Map 
operator with UDF /' over the Cartesian product: 

Match/ (P, 5) = Ma P/ ,(P X 5) 

The difference here is that we need to change the UDF / in order to 
incorporate the implicit equi-join performed by the Match second- 
order function. Assume that the join keys are attributes R.A, S.B. 
We substitute / with 

/'(r|s) = if (R.A = S.B) then f(r,s) else _L 

We stress that this is a conceptual transformation in order to estab- 
lish reordering conditions; all optimizations described in this paper 
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are non-intrusive. This transformation simply means that the at- 
tributes used as keys for the Match operator are added to the read 
set R/ of the operator. 

Using the above transformations, plans that contain Match, Cross, 
and Map operators are equivalent to plans that contain only Map 
operators and Cartesian products. Therefore, it only remains to es- 
tablish when the latter two can be reordered: 

Theorem 3. A Map operator with UDF / and a Cartesian product 
operator Rx S can be reordered as 

Map / (i? x S) = Map / (fl) x S 

iff (Rf U W/) n S = 0, where S is the attribute set of S. The case 
of pushing the operator to the other side of the Cartesian product is 
symmetric. 

The proof follows directly from the fact that (R/UW/)nS = 
f(r\s) = f(r)\s. 

It is straightforward to construct the conditions that allow re- 
ordering for Match, Cross, and Map operators using Theorems 1 
and 3. We now show the proof for reordering two Match operators 
with first order functions /, g, and key attributes K/ , K g as a series 
of transformations: 




(c) 


(d) 


(e) (f) 


Map g , 


Map^/ 


Mapy, Matchy 


Ma P/ , 


1 

Map 9 , 


X R Match g 


X 


1 

X 


K Ma Pj , S^f 


X 


T R^\ 


X 









Step (a) — > (b) substitutes the Match operators with their Map and 
Cartesian product equivalents. Step (b) — > (c) reorders Map^, with 
the Cartesian product with T. For plans (b) and (c) to be equivalent 
it is necessary that /' does not use attributes of T ((Rf U W^/) n 
T = 0). Step (c) — > (d) makes use of the conditions of Theorem 1 
(namely the ROC condition on UDFs /',</) to reorder the two Map 
operators, and reorders the two Cartesian products using the normal 
associativity rule. Step (d) — >• (e) pushes Map g , under the Cartesian 
product, requiring the condition (R g / U W g /) n R = 0. Finally, 
step (e) — > (f) reconstructs the Match operators. By collecting the 
conditions needed by the series of transformations, we arrive at the 
conditions to reorder two Match operators. 

Lemma 1. Two Match operators with UDFs /, g and key sets 
K f C R U S, K 9 C S U T can be reordered iff the ROC condi- 
tion holds for /', g', (R / /UW / )nT = 0, and (R g ; U W g ) n R = 
where Rf = R f U K f , and R g , = R g U K g . 

By repeating the same process for each pair of Match, Cross, and 
Map, we establish similar conditions for all combinations of these 
operators. 

4.3.2 Key-at-a-time Operators 

Incorporating KAT operators (Reduce and CoGroup) requires 
stricter conditions, since groups must be preserved. We first show 
how to reorder a Reduce operator with a Cartesian product. 

Theorem 4. A Reduce operator with UDF g and key K U R and a 
Cartesian product operator R x S can be reordered as 

Reduce 9i KuR(-R X S) = R X Reduce SiK (5) 

iff (R g U W g ) n R = 0. 



PROOF. Assume the data sets R = [n : i = 1 . . . , n], S = 
[si : i — 1 . . . , m] . The key of the Reduce operator K U R includes 
all attributes of data set R. Note that K C S. Every record of the 
Cartesian product can be written as r» | kj \ s' k , where kj is the part of 
the S record with attributes K, and s' k is the part of an S record with 
non-key attributes. Every record Ti\kj\s' k of the Cartesian prod- 
uct belongs to the same Reduce group Gij , determined by n and 
kj only. The output of the plan is [g(Gij),Gij = {ri\kj\s' k }]. 
Assume that g does not use any attribute of R for any purpose 
other than grouping its input data set. Then, it is safe to "push" 
Reduce 3 to the data set S and remove the R part of the Reduce 
key. This will produce groups Gj = {fc;|sj.}, and the output of 
the reduce will be [g(Gj)]. By performing the Cartesian product 
of these groups with R, we get the set of records Vi\g(Gj). If the 
Reduce UDF g simply emits the R attributes unchanged, we have 
n\g(kj\sk) = g{n\kj\sk). □ 

Using the above transformation, we can, in principle, reorder Re- 
duce with Match and Cross operators by transforming the latter to 
Map operators over Cartesian products. It is not very often that 
the Reduce key includes all attributes of a data set. However, we 
can consider special cases where it is safe to add the R attributes 
to the Reduce key without changing the result. One case is when 
\R\ = 1. This appears quite often in practice when implement- 
ing SQL queries with correlated subqueries that return a single tu- 
ple. More interestingly, using Theorem 4 as basis, we can arrive 
at a Match-Reduce transformation similar to the invariant grouping 
transformation in relational DBMSs [13]. Assume the plan 

(a) Reduce 9iF (Match/ i ij.K=s.F(^, S)) 

where the Match keys are K C R, F C S, and the Reduce key 
is a superset of F. Assume that F is a foreign key to K. Then, in 
every record received by the Reduce operator, the F part uniquely 
determines all R attributes. We can therefore add R to the key of the 
Reduce operator without changing the Reduce groups, and apply 
Theorem 4 to push the Reduce under the Match. As always, the 
ROC and KGP conditions must hold in order to reorder the Reduce 
and Map UDFs. The transformation steps taking plan (a) above as 
the starting point are shown below. 



(b) 

Reduce g ,FuR 
I 

Map^, 

I 

X 



(c) 
Mapy, 



Reduce g ,FuR 



R S 



(d) 
Mapy 

I 

x 



(e) 

Match/, H .K=s.F 




The last step is to incorporate CoGroup operators. We note that 
a CoGroup operator can be conceptually transformed to a Reduce 
operator over the tagged union R Ut S of its inputs R, S: 

CoGroup g (i?, S) = Reduce g /(i? Ut S) 

The tagged union of two data sets R and S is simply the data set R 
followed by the data set S, where each record has an additional lin- 
eage attribute I, which tracks the data set that the record originates 
from. The CoGroup UDF g is properly annotated to distinguish be- 
tween data sets based on the lineage attribute, yielding the Reduce 
UDF g' . 

Map and Reduce operators can be pushed down under the tagged 
union R Ut S if their UDFs operate only on one of the tagged 
union's inputs. This can be properly detected using the lineage at- 
tribute I. For example, assume that we want to push a Map operator 
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with UDF / under the tagged union R Ut S, and that / uses only 
R attributes. We can define a UDF fa as 



f R (r) = 



f(r) if r.l = R 
r otherwise. 



thus forcing the Map UDF / to ignore S records. This transforma- 
tion yields 

Map /R (i?U T S) = Ma P/ (i?) Ut S 

and allows the following series of transformations that show how a 
Map operator can be reordered with a CoGroup operator. 



(a) 
Mapy 



(b) (c) (d) (e) 

Mapy Reduce,,/ Reduce g / CoGroup g 

I I I 

CoGroup g Reduce 9 * Ma P /j? U T Wa P/ ^ 

^ I I 1 I 

R S u T Ut „ R 
Map^ b 

iC^S iC^S I 
R 

Step (a) — > (b) replaces CoGroup with its Reduce equivalent. Step 

(b) — > (c) uses the conditions of Theorem 2 to reorder the Map 
and Reduce operators and transforms Map's UDF / to Jr. Step 

(c) — > (d) pushes the Map operator under the tagged union by re- 
versing the previous transformation. Finally, step (d) — > (e) recon- 
structs the CoGroup operator and transforms fa back to /. We fol- 
low the same procedure to establish reordering conditions between 
CoGroup and other operators. 

4.4 Possible Optimizations 

We have presented the necessary and sufficient conditions to re- 
order every combination of PACT operators. These conditions are 
usually the ROC and the KGP conditions, together with some re- 
strictions on the key of the Reduce operator. 

These conditions lead to a number of possible optimizations. 
First, assuming a straightforward implementation of an acyclic SQL 
query as a PACT program, our conditions allow the full set of join 
and selection reorderings that RDBMS optimizers consider. Sec- 
ond, we allow the invariant grouping transformation [13], the most 
elementary form of aggregation push-down. More advanced trans- 
formations that include group-by considered by RDBMS s assume 
knowledge of the nature of the aggregating function, and are thus 
of limited applicability in settings of arbitrary UDFs as ours [14]. 

We do not allow reorderings that need semantic information to 
be established, including associative side-effects. For example, we 
cannot reorder two Map functions that add a constant number to 
the same field. In addition, the fact that we discover the necessary 
conditions for reordering through static code analysis poses further 
restrictions to the possible optimizations (see Section 5 for details). 

5. DISCOVERING PROPERTIES VIA 
CODE ANALYSIS 

The reordering proofs presented in Section 4 assume knowledge 
of a global record, read and write sets for each operator, as well as 
bounds on the output cardinality. In this section, we briefly sketch 
our solution to estimating read and write sets, as well as creating a 
global record, via static code analysis. We omit the details for emit 
cardinalities, which can be estimated by traversing the control flow 
graph of a UDF. 

Our solution relies on a static code analysis (SCA) framework 
that analyzes the Java bytecode of a UDF. We assume that the 



framework provides a control flow graph and two data structures 
that are obtained by a data flow analysis: A use-definition chain 
Use-Def(7, $t) of a statement / and variable $t is a list of all 
possible definitions of variable $t that reach / without being over- 
ridden by other definitions. A definition-use chain Def-Use(/, $t) 
is a list of all uses of variable $t defined in statement /. 

For the remainder of this section, we assume that the UDF code 
is formatted as typed three-address code [5]. The possible state- 
ments in three-address code are definitions of a local (e. g., int i) 
or a temporary (e. g., int $t) variable, assignment (e. g., $t : =3), 
branching (e. g., if ($t<3) goto label), as well as basic arith- 
metic and function calls. In addition, we assume the existence of an 
attribute type, Attribute, as well as record types InputRecord, 
and OutputRecord, and a set of functions that operate on these 
types. These functions constitute essentially the assumed record 
API, which is exposed to the programmer of PACT programs, and 
they are gradually introduced in the course of this section. 

We estimate the read set R/ of an operator by scanning its UDF's 
code for statements of the form 1 : $t : =getField($ir ,n) . State- 
ment 1 stores the n-th field of the (parameter variable) input record 
$ir to the temporary variable $t. We assume that this is the only 
record API function to access a particular field of an input record. 
We further assume that integer n is statically computable. Recall 
that the n-th field of the input I corresponds to attribute a(I, n) of 
the global record. We then look up all uses of the temporary vari- 
able $t in the code using the data structure DEF-USE($t). If such 
uses exist, then we add the attribute a(I, n) to R/. 

Estimating the write set W/ of an operator is more challeng- 
ing than read set estimation since also implicit modifications must 
be taken into account. Our record API provides two constructors 
to create an output record $or. First, a copy constructor $or=new 
OutputRecord (Sir) to copy an input record $ir. Second, the de- 
fault constructor $or=new OutputRecordO to create a new and 
empty output record $or. The subtle difference is that the first con- 
structor implicitly copies all attributes of the input record (Implicit 
Copy) while the second method implicitly projects all attributes 
(Implicit Projection). In addition, the API provides methods to ex- 
plicitly copy, project, modify, and add single attributes to output 
records. Therefore, the code analysis method to estimate write sets 
must identify whether a user function implicitly copies or projects, 
and estimate a complementary set of explicitly projected or copied 
attributes. In addition, a set of modified and added attributes must 
be derived. 

In order to identify the implicit operation and the attribute sets 
required for the write set estimation, we start by collecting all state- 
ments of the form e:emit($or) which emit the output record 
$or. We track the origin of $or and can safely identify the im- 
plicit operation by identifying the constructor call. If both con- 
structors are used in different code paths, implicit projection is the 
safe choice. Subsequently, the remaining attribute sets are esti- 
mated by collecting all statements s : setField($or ,n, $t) , that 
set the n-th field of output record $or to the value of the tem- 
porary variable $t. Explicit projections can be identified if $t 
is null. Explicit copies require that $t was previously set by 
1 : $t : =getField($ir ,n) . This can be easily detected by looking 
at USE-DEF($t). In all other cases, statement s defines an explicit 
modification operation and is added to the appropriate set. Note 
that it is always safe to consider s as an explicit modification. Our 
implementation includes an additional record constructor $o=new 
OutputRecord ($il,$i2) that concatenates two input records in 
order to support efficient binary UDFs. This constructor yields im- 
plicit copy operations for both input records. By looking at all 
statements s, we can also keep track of the global record. A new 
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attribute a(0, n) is added to the global record if integer n is larger 
than #1, the number of attributes in the input /. 

The most important property of any method that relies on static 
code analysis is to guarantee safety. In our setting, safety is de- 
fined as follows: Our analysis algorithm creates a set of properties, 
which in turn lead to a certain set of possible reorderings. These 
reorderings result in a set of plans P' equivalent to the initial plan 
P. Our method is safe if P 1 and P produce the same query result 
for every possible input /. 

We guarantee safety through conservatism. In particular, we 
guarantee that the properties discovered by our static code anal- 
ysis algorithm are supersets of the true properties of any execution 
of the program for any collection of input data sets. We omit the 
proofs due to lack of space. The main intuition is that we consider 
all possible execution paths of operators, and we add an attribute 
to the global record, and read and write set of an operator when 
in doubt. Since the discovered properties are supersets of the real 
properties, they cause additional conflicts (see Section 4) leading 
to a subset of the valid reorderings, and thus to a subset of the true 
equivalent alternative plans. 

6. PLAN ENUMERATION 

In this section, we present an algorithm that, for a given data 
flow, enumerates all data flows that can be derived by valid pair- 
wise reorderings of operators. The algorithm differs significantly 
from the well-known enumeration algorithms used in traditional 
relational database optimizers, namely enumeration via top-down 
branch-and-bound [19,21] or bottom-up dynamic programming [27, 
29]. This is due to the difference in the algorithm input. Tra- 
ditional relational optimizers operate on algebraic expressions on 
which heuristics such as selection and projection push-down can 
be applied and from which data structures such as join graphs can 
be derived. In contrast, our enumeration algorithm is called with 
a specific data flow instance from which all valid reordered data 
flows must be generated. In the presented version, the algorithm is 
restricted to tree-shaped data flows, i. e., an operator may only have 
a single ancestor. 

Algorithm 1 provides pseudocode for enumerating all valid alter- 
natives for a given data flow. The algorithm is based on recursive 
calls to enumerate alternatives for sub-flows and the exchange of 
two neighboring operators. In the listing, data flows and sets of data 
flows are denoted with capitalized names while operators and sets 
of operators have lowercased names. The functions getRoot(D) 
and rmRoot(D) return or remove the root of the data flow D, while 
addRoot(D, r) appends r as root of D and setRoot(Z), r) re- 
places D's root with r. For ease of exposition, the algorithm as 
shown handles data flows with single-input operators only. How- 
ever, it can be easily extended to deal with non-unary operators, 
and our implementation can, in fact, handle binary operators. 

We discuss the algorithm and argue that it computes all valid 
reordered data flows with the help of an example data flow D — 
[Src — > — > Map 2 — > Map 3 ]. The flow consists of a data 
source Src and three Map operators with Map 3 being the root. We 
assume that all Map operator pairs can be reordered except for Map 2 
and Map 3 . The algorithm starts by recursively enumerating all re- 
ordered alternatives Alts~ r for D- r , which is the input data flow 
D minus the root operator r (Map 3 ) (Line 18): 

Alts-Mzp^ = Enum-Alternatives([Src — > Map : — > Map 2 ]) 
= {[Src — > Map-L — > Map 2 ], [Src — > Map 2 — > MapJ} 

The result of the first recursive call Alts- r is used for two pur- 
poses. First, to enumerate a subset of the result Alts, namely all 



Algorithm 1 Enumeration of Alternative Data Flows. 
1: function Enum-Alternatives(D) 



2: input: data flow D 

3 : output: all possible data flows derived by reordering of D 

4: Alts = niTa6.get(getMTabKey(D)) // check memoTable 

5: if (Alts ^ 0) then 

6: return Alts 

7: r = getRoot(D) // get root r of D 

8 : if (r is data source) then 

9: Alts = {r} 

10: else if (r is data sink) then 

11: D- r = rmRoot(_D) 

12: AltS- r = Enum-Alternatives(D_ r ) 

13: for (A- r e Alts- r ) do // add r to each A- r 

14: Alts = Alts U {addRoot(A_ r ,r)} 

15: else if (r is single-input operator) then 

16: cand = 

17: D- r = rmRoot(D) 

18: Alts- r = Enum-Alternatives(D_ r ) 

19: for (A-r 6 Alts- r ) do 

20: s =getRoot(A_ r ) // get candidate root s 

21: Alts = Alts U {addRoot(A_ r ,r)} //addr to A- r 

22: if (s £ cand A reorderable(r, s)) then 

23: cand = cand U {s} // enum candidate s only once 

24: D— s =setRoot(A_ r , r) //replace s by r 

25: Alts — s = Enum-Alternatives(D- s ) 

26: for A-s e Alts- S do // append s to each A- s 

27: Alts = Alts U {addRoot(A_ s , s)} 

28: mTa6.put(getMTabKey(D), Alts) 

29: return Alts 



reordered flows with the original root r. This is done by simply 
appending the root r (Map 3 ) to each computed alternative A- r £ 
Alts- r (Line 21): 

Alts = {[Src -> Mapj -> Map 2 -> Map 3 ]} U 
{[Src — > Map 2 — > Mapj — > Map 3 ]} 

Second, Alts- r is used to retrieve candidate root operators s that 
can be reordered with r. For each root s of the computed alter- 
natives A-r G Alts-r, the algorithm checks whether it can be 
reordered with the original root r (Map 3 ) by calling the Boolean 
function reorderable(r, s) (Line 22). In our example, this is 
only true for s = Mapj and r = Map 3 since Map 3 and Map 2 
cannot be reordered. Therefore, Map 3 replaces Map-,^ as root of 
A-r = [Src — > Map 2 — > MapJ , i. e., r is pushed down to data 
flowD_ s (Line 24): 

-D-Ma Pl = [Src -> Map 2 -> Map 3 ] 

The successive recursive call Enum-Alternatives(D_ s ) enumer- 
ates all valid reorderings for the D- s (Line 25): 

AltS-n& Pl = Enum-Alternatives([Src — > Map 2 — > Map 3 ]) 

= {[Src — > Map 2 — > Map 3 ]} 

The result set Alts is amended by all valid reorderings that have s 
as root. This is done by simply appending s to all reordered flows 
As G Alts- S (Line 27): 

Alts = Alts U {[Src -> Map 2 -> Map 3 -> Mapj]} 

Finally, all computed alternatives Alts are returned: 

Alts = {[Src — > Mapj — > Map 2 — > Map 3 ], 

[Src — t Map 2 — > Map-,^ — T Map 3 ] , 

[Src — t Map 2 — T Map 3 — > MapJ} 

In order to avoid duplicate enumerations, the algorithm may only 
descent once into recursion for each distinct root candidate s (Lines 
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16, 22, 23). The use of a memo table reduces the number of recur- 
sive descents and improves the runtime (Lines 4, 28). 

The enumeration algorithm can also be easily integrated with 
a Volcano-style physical optimizer using interesting properties as 
described in [7,21]. Instead of computing and returning all valid 
reordered data flows, the Enum-Alternatives () function can be 
adapted to compute the least expensive physical execution plan for 
each interesting property. Additionally, the algorithm must take 
care that at least one plan for each possible root node s of a sub- 
flow is returned, in order to enumerate all possible reorderings. 
Physical execution plans are generated by recursively computing 
the least expensive execution plans for sub-flows, choosing local 
and shipping strategies only for the root node, and connecting it 
to the sub-plan. Interesting properties can be tracked during recur- 
sive descent and be used to enumerate physical execution plans for 
sub-flows. By integrating cost-based physical optimization in the 
enumeration algorithm, the principle of optimality can be exploited 
which effectively reduces the number of enumerated alternatives. 

In contrast to optimization of relational queries, our approach 
for enumerating reordered data flows is limited by the choice of 
the initial data flow. For some queries, such as queries that include 
circular join graphs, the initial data flow already implies a plan de- 
cision that cannot be changed by reordering operators. 



7. EVALUATION 

We implemented a prototype to evaluate our approach for data 
flow optimization. The prototype is based on a pre-release snapshot 
of the next version of the Stratosphere system which is available as 
open source [4]. Furthermore, we implemented data processing 
tasks from different domains as PACT programs to experimentally 
evaluate and validate our approach. The domains include relational 
OLAP, as well as weblog clickstream processing and biomedical 
text mining. Our experimental evaluation covers the following as- 
pects. First, we assess the optimization potential for parallel data 
flows. Second, we evaluate the plan space enumerated by our op- 
timizer. Third, we discuss the overhead of the plan enumeration 
algorithm. Finally, we verify that static code analysis can be used 
to derive the necessary properties for reordering UDFs. 

We start discussing our prototypical implementation and present 
the PACT programs used for evaluation before we show and discuss 
experimental results. 

7.1 Experimental Setup 

The existing optimizer of Stratosphere performs cost-based phys- 
ical optimization as known from parallel relational optimizers, i. e., 
it selects data shipping and execution strategies such as broadcast- 
ing and hybrid-hash joins for a given data flow [7]. The cost model 
is a combination of network IO, disk IO, and CPU costs of UDF 
calls. For result size and cost estimations, the optimizer relies on 
hints such as "Average Number of Records Emitted per UDF Call", 
"CPU Cost per UDF Call", and "Number of Distinct Values per 
Key-Set". These can be provided by the user, a language compiler 
(e. g., Hive or Pig), or obtained by runtime profiling. 

In order to implement our prototype, we adapted the optimiza- 
tion process of Stratosphere's optimizer in the following ways. Prior 
to plan enumeration, the optimizer obtains information about the 
UDFs which is required to reason about reorderability of operators. 
This information can be provided by manually attached annotations 
or derived by an SCA component. Our SCA component is based on 
the Soot framework [3], which provides all features required by our 
code analysis technique (see Section 5). It does also take care of 



establishing the global record. After the information has been ob- 
tained, all valid alternative data flows are computed using the enu- 
meration algorithm presented in Section 6. The existing cost-based 
optimizer [7] is called for each alternative to choose shipping and 
local strategies and compute a cost estimate. Finally, the cheapest 
plan is selected and returned for execution. 

We perform our experiments on a cluster of four machines each 
being equipped with two Intel Xeon E5530 Quadcore CPUs, 48 
GB RAM, and ten 250 GB disks for data bundled in a RAID5. 
The machines are connected with 1 GBit Ethernet and run Linux 
(Ubuntu Server 10.04.3 LTS), Sun Java 6, and HDFS 0.20.2. We 
execute all tasks with a degree of parallelization of 32. 

7.2 Evaluation Programs 

We evaluate our approach using four tasks from different do- 
mains. Algebraic optimization of relational queries is best known 
from relational DBMS but also applied in the context of parallel 
data flow systems by higher- level languages such as Hive [31], 
SCOPE [12], and Tenzing [16]. In order to show the effectiveness 
of our approach, we implemented two queries of the TPC-H bench- 
mark for evaluation. Parallel data flow engines are commonly used 
for non-relational tasks. We show the applicability of data flow op- 
timization for such domains by providing two non-relational tasks, 
namely biomedical text mining and weblog clickstream processing. 
All four tasks are implemented as handcrafted PACT data flows. In 
this section, we shortly present all tasks and their implementations. 

Relational OLAP: We implemented slightly modified variants of 
queries 7 (where we reduced the selectivity of the shipdate filter 
and removed the final sorting) and 15 (where we removed the fil- 
ter on total_revenue) from the TPC-H benchmark to cover rela- 
tional analytical tasks. For our experiments, we run both queries on 
a 400 GB TPC-H data set. Query 7 applies a local predicate on the 
lineitems relation, joins six relations with circular-connected join 
predicates, and finally performs a grouping with sum aggregation. 
Figure 2(a) shows our PACT implementation. All joins are imple- 
mented as Match operators except the join with the disjunctive join 
predicate (nationi N nation), which is implemented as a fil- 
tering Map operator. Grouping and sum aggregation are done by a 
Reduce operator. 
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Figure 2: PACT data flows of Query 7: (a) Implemented data 
flow, (b) 1st ranked reordered data flow. 

Our query 15 applies a local predicate on the lineitem relation, 
joins it with the supplier table, and groups and aggregates to com- 
pute the final result. We implemented the local predicates as a Map, 
the join as a Match, and the grouping and aggregation as a Reduce 
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operator (see Figure 3(a)). Note that the join predicate and the 
grouping clause reference the same attribute (s_key). As shown 
in Section 4.3.2, this is a necessary condition to reorder a Match 
and a Reduce operator. 

(a) (b) 
Match s M ; Reduce 7 3 _ key ^ 

revenue. 

I 

_T °- ke v< Match,, i 



Reduce 



evenue 



^Pc(date>a) S Ma P<r(date>a) 

A(date<6) A(date<6) 



Figure 3: PACT data flows of Query 15: (a) Implemented data 
flow, (b) Data flow with Match before Reduce. 

Text Mining: We implemented a text mining task that detects re- 
lationships between genes and drugs described in biomedical text 
corpora. The data flow is a pipeline of Map operators which ex- 
tract entities and relationships by applying several natural language 
processing (NLP) algorithms to the input. Our program takes a 
text collection as input and performs some linguistic preprocess- 
ing, e. g., tokenization and part-of-speech tagging on the input, to 
enable entity and relation extraction. In order to reduce interme- 
diate result set sizes, each entity or relation extraction component 
also works as a filter by forwarding only those records that actu- 
ally contain a gene, drug, or relation mention. Most NLP compo- 
nents are very compute-intensive since they often call third-party 
machine-learning or automaton-based components to enable the 
extraction process. Furthermore, most components have dependen- 
cies on other components to be executed in advance. These depen- 
dencies limit the set of valid reordered data flows. Optimization 
potential arises from different filter selectivities and varying exe- 
cution costs for the text mining components. We execute the text 
mining data flow on a 425 MB subset of the PubMed data set. 

Clickstream Processing: Weblog processing is a common exam- 
ple of non-relational data flows [20]. We implemented a task that 
processes web shop clickstream data (see Figure 4(a)). The task 
extracts click sessions that lead to buy actions and augments them 
with detailed user information. Such tasks are common preprocess- 
ing steps for data mining algorithms. In our scenario, a clickstream 
entry contains an IP address, a timestamp, and a visited URL. The 
URL encodes a session id, and the performed user action. The first 
Reduce operator filters on sessions that include at least one buy ac- 
tion. The successive Reduce operator condenses a session into a 
single record. The following Match operator joins by session id 
with a relation that resolves session ids to ids of logged in users, 
thereby selecting only sessions with logged in users. Finally, a sec- 
ond Match operator appends detailed user information by joining 
on the user id. For our experiments, we ran the task on 430 GB 
click, 13.8 GB login, and 9.2 GB user_inf o data. 
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Match Match 
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7.3 Experiments 

Optimization Potential: Query optimization as performed by mod- 
ern relational DBMSs has the potential to improve query execution 
time by orders of magnitude. Our first set of experiments assesses 
the potential of our generic data flow optimization technique. We 
enumerate all possible data flows for a given PACT program. Each 
reordered alternative is fed into the physical optimizer where ship- 
ping and local execution strategies are enumerated, and cost esti- 
mates are obtained. We sort the resulting plans in ascending order 
by their estimated costs and assign a rank to each plan that corre- 
sponds to its position in the list. We pick ten plans in regular rank 
intervals from the list and execute them. For each executed plan, we 
plot the cost estimate of the optimizer and the actual runtime (av- 
eraged over three runs), both normalized by the lowest estimated 
costs and averaged runtime respectively. 

Figure 5 shows the results for TPC-H query 7. The enumera- 
tion algorithm explored a space of 2518 alternative plans. We see 
that the plan with the least estimated costs provides also the least 
execution time with an absolute runtime of roughly 6 minutes (see 
Figure 2(b) for this plan). The last ranked plan is slower by a factor 
of 7 and requires the most time for execution (about 45 minutes). 
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Figure 5: Normalized cost estimates and execution runtime for 
10 regularly picked execution plans of the TPC-H query 7. 

Figure 6 shows the estimated costs and runtimes for selected 
plans of the text mining task. The best plan (according to estimated 
costs) outperforms the worst by almost an order of magnitude. 
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Figure 4: PACT data flows of clickstream processing task: (a) 
Implemented data flow, (b) 1st ranked reordered data flow. 



Figure 6: Normalized cost estimates and execution runtime for 
10 regularly picked execution plans of the text mining job. 

Our experiments show that reordering of data flows can lead to 
significant performance improvements. Due to the observation that 
in general execution plans with higher cost estimates require more 
time for execution, we can also approve the validity of the opti- 
mizer's cost model. We note that Stratosphere does not support in- 
dexes, columnar layouts, or materialized views yet. Therefore, all 
execution plans result in full scans of all included data sets, which 
limits the achievable runtime improvements. 

Plan Enumeration Space: We continue discussing the plan enu- 
meration space with TPC-H query 15. Our implementation is based 
on a Map, a Reduce, and a Match operator (see Figure 3). We can 
exchange Match and Reduce since the ROC condition is fulfilled, 
Match preserves the group cardinality because it is a PK-FK join, 
and Reduce groups on the match key (s_key). This is essentially 
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an aggregation push-up rewrite that could also be applied by a rela- 
tional optimizer. Besides the changed order of Reduce and Match, 
the rewrite also leads to different physical plan choices. 

For the data flow with Reduce being the input of Match (Fig- 
ure 3(a)), the physical optimizer chooses to partition the input of 
Reduce and establish the groups by sorting. The grouped and ag- 
gregated result is locally forwarded into the Match operator and 
used to build a hash table. Since Match operates on the same key 
as Reduce, the partitioning property remains and can be reused. To 
compute the final result, the supplier relation is also partitioned, 
shipped to the Match operator, and probed against the hash table. 
In fact, the optimizer could also choose to reuse the sorting of Re- 
duce and perform a sort-merge join for Match. However, this would 
require to sort the supplier relation. 

The alternative data flow with Match being the input of Reduce 
(Figure 3(b)) is executed using a different shipping strategy. In this 
case, Match's lineitem input is much larger than the supplier input, 
since it has not been aggregated as in the previous case. Therefore, 
the optimizer decides to broadcast the much smaller supplier input 
to all parallel instances of Match and build a hash table from it. 
The lineitem side is locally forwarded and probed against the hash 
table. The result is partitioned and shipped to Reduce which groups 
by sorting and computes the final result. 




Plan Rank 



Figure 7: Normalized cost estimates and execution runtime for 
all 4 execution plans of the clickstream processing job. 

As previously stated, our optimizer is also able to reorder non- 
relational operators. Figure 4 shows (a) the implemented PACT 
program and (b) the data flow chosen by the optimizer for the click- 
stream processing task. Both Reduce operators are non-relational 
operators. The "Filter Buy Sessions" UDF is called with all click 
records of a session and checks whether at least one click performs 
a buy action. In that case, all click records are forwarded, other- 
wise none. The subsequent "Condense Sessions" UDF collects all 
clicks of a session, merges them into a single record and forwards 
it. Comparing the best performing and the implemented data flow, 
we see that the optimizer pushed the selective join ( "Filter Logged- 
In Sessions ") below both non-relational Reduce operations. We are 
not aware of a data processing system that is able to perform sim- 
ilar optimizations. Figure 7 gives the estimated costs and runtimes 
of all four execution plans. The best performing plan beats the im- 
plemented data flow (Rank 3) by a factor of 1.4 or 13:47 minutes. 

Our optimizer explores large fractions of the search space that 
conventional relational optimizers cover, including bushy join or- 
ders (Figure 2), pushed aggregations (Figure 3), and reasoning about 
interesting properties [7]. Furthermore, we show that our approach 
enables optimizations that are not supported by any current data 
analysis system we are aware of (Figure 4). 

Enumeration Time: Our enumeration algorithm is facing the same 
problem of exponential search space sizes as relational optimizers. 
As previously discussed, our prototypical implementation first enu- 
merates all valid reordered data flows and subsequently calls the 
physical optimizer for each candidate. This implementation does 



PACT Task 


Enumerated Orders with 


Enumerated 




Manual Annotation 


Orders with SCA 


Clickstream 


4 


3 (75%) 


TPC-H Q7 


2518 


2518 (100%) 


TPC-HQ15 


4 


4 (100%) 


Text Mining 


24 


24 (100%) 



Table 1: Comparing number of reordered alternatives for man- 
ually annotated and automatically derived read and write sets. 

not permit cost-based search space pruning and it is not tailored to- 
wards efficient plan enumeration. In Section 6 we gave an intuition 
how the enumeration algorithm could integrated with physical op- 
timization. An important part of our future research is to leverage 
well-known search space pruning techniques and benchmark the 
overhead of our query optimizer. For all queries presented so far, 
which represent typical data analysis tasks, plan enumeration took 
less than 1654 ms using our naive implementation. The overhead 
of performing the static code analysis is virtually zero. 

Feasibility of Static Code Analysis: We evaluate the feasibility of 
static code analysis to determine read and write sets of UDFs. For 
this purpose, we compare the number of reordered alternative data 
flows that were enumerated based on read and write sets which 
were manually annotated and automatically derived using static 
code analysis. Table 1 gives the results for all presented evaluation 
tasks. The information extracted by our prototypical implementa- 
tion of the SCA component enables the optimizer to enumerate al- 
most all valid plans for our four evaluation data flows. The current 
implementation is restricted to information that is available at UDF 
compile time and can be easily accessed such as field accesses with 
literals and final variables. This can be extended to more exhaus- 
tive control flow tracking and incorporation of job configurations 
which are only available at optimization time. 

8. RELATED WORK 

We are not aware of any work that aims to optimize data flows 
with unknown operator semantics by reordering. The work most 
relevant to ours is Manimal [26]. Manimal applies static code anal- 
ysis to MapReduce programs to identify relational-style selections 
and projections. An optimizer selects from available B + -tree in- 
dexes and decides on the use of delta-compression. Manimal's op- 
timizations are orthogonal to ours (operator reordering), and would 
constitute valuable additions to our system. 

Optimization of user-defined predicates was discussed in the con- 
text of extensible RDBMSs [15,23]. This line of work only con- 
sidered UDFs with the semantics of relational selection operators. 
Therefore, the challenge was not to identify when reordering is pos- 
sible, but when it is beneficial. 

A number of recent approaches consider optimization in the con- 
text of translating a higher-level algebraic specification to a data 
flow. Higher-level specifications include AQL [9], Pig [28], Jaql 
[10], Hive [31], Tenzing [16], DryadLINQ [33], and SCOPE [12]. 
The target parallel data flow platforms include MapReduce [17], 
Dryad [25], and Hyracks [11]. In contrast to these approaches, our 
work applies optimization directly to data flows without knowledge 
of the operator's algebraic properties. We view the two approaches 
as complementary; while we show that some optimizations can be 
done at the data flow level, thus making a data flow engine able to 
seamlessly handle multiple data and programming models, other 
optimizations are semantic in nature and can only be done at a 
higher level. We note that higher-level language translators can 
enrich data flows with reordering information based on the opera- 
tors semantics, hence enabling the unified optimization of operator 
order and physical optimization at the data flow abstraction. 
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The Starfish project applies cost-based optimization to MapRe- 
duce programs [24]. In contrast to our work, Starfish does neither 
inspect or optimize the program itself. Instead, it uses runtime pro- 
filing and cost-based optimization to generate well-performing job 
configurations for Hadoop MapReduce jobs. 

Finally, we draw inspiration from the Ferry project [22]. Ferry 
follows an algebraic approach to push data processing instructions 
from the application into the DBMS by translating general-purpose 
(application) code to SQL queries. 

9. CONCLUSIONS AND FUTURE WORK 

We propose and address the problem of optimizing data flows 
that consist of black box user-defined functions written in an imper- 
ative language. In this setting, the algebraic properties of the oper- 
ators of the data flow are unknown, and must be discovered. Our 
key insight is that a handful of properties, which can be discovered 
using static code analysis, suffice to establish many optimizations 
known from relational algebra, including filter and join reordering, 
and some forms of aggregation push-down. We formally establish 
reordering conditions, show how to estimate the desired properties 
via static code analysis, and present a plan enumeration algorithm. 

We have prototyped our solution in the Stratosphere system. Our 
experimental results show that our approach is able to reorder re- 
lational and non-relational data flows, leading to runtime improve- 
ments of up to an order of magnitude. Moreover, we demonstrate 
that our approach is able to perform optimizations which algebraic 
optimizers are not capable of. Our experiments attest, that our 
static code analyzer successfully extracts properties from black box 
UDFs that are required for reordering them. 

We identify several avenues for future research. While this work 
explores which reorderings of data flows are possible, we plan to 
identify which reorderings are beneficial. This will include esti- 
mating the selectivity and execution cost of black box operators. 
Furthermore, we plan to investigate a wider range of optimizations 
including managing attribute projections globally in a plan, opti- 
mizations that take into account some semantic information of op- 
erators, and intrusive optimizations that change the code of the user 
functions. The latter could include, dissecting an operator into in- 
dependent components that can be then individually reordered. We 
plan to exploit the state of the art of semantic program analysis to 
gain more information about the internals of the operator UDFs. 
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